🧩 Visualization - Reusable Functions for Impact Analysis Part I¶
This section provides reusable, parameterized functions for analyzing and visualizing performance metrics across temporal and categorical dimensions. Designed for flexibility and clarity, the functions support:
Dynamic grouping by time (year, month_name, day_name) or category (store, promo, etc.)
Preprocessing filters to exclude non-operational records (e.g., closed stores, zero-sales days)
Statistical summaries including mean, standard deviation, and count
Ranked insights with volatility and performance differentials
Interactive visualizations via Plotly for enhanced interpretability
These tools enable scalable impact assessments and trend analyses across diverse datasets with minimal code repetition.
1. Setup & Imports Libraries¶
import time
# Step 1: Setup & Imports Libraries
print("Step 1: Setup and Import Libraries started...")
time.sleep(1) # Simulate processing time
Step 1: Setup and Import Libraries started...
# Data Manipulation & Processing
import math
import numpy as np
import pandas as pd
import scipy.stats as stats
from datetime import datetime
from sklearn.preprocessing import *
# Data Visualization
import seaborn as sbn
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import plotly.io as pio
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from pandas.plotting import scatter_matrix
# to ensure Plotly works in both Jupyter and HTML export
pio.renderers.default = "notebook+plotly_mimetype"
sbn.set(rc={'figure.figsize':(14,6)})
plt.style.use('seaborn-v0_8')
sbn.set_palette("husl")
# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format','{:.2f}'.format)
# Warnings
import warnings
warnings.simplefilter('ignore')
warnings.filterwarnings('ignore')
print("="*60)
print("Rossman Store Sales Time Series Analysis - Part 2")
print("="*60)
print("All libraries imported successfully!")
print("Analysis Date:", pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S'))
============================================================ Rossman Store Sales Time Series Analysis - Part 2 ============================================================ All libraries imported successfully! Analysis Date: 2025-08-16 01:02:27
print("✅ Setup and Import Liraries completed.\n")
✅ Setup and Import Liraries completed.
# Start Impact Analysis
viz_impact_analysis_begin = pd.Timestamp.now()
bold_start = '\033[1m'
bold_end = '\033[0m'
print("🔍 Viz impact Analysis Started ...")
print(f"🟢 Begin Date: {bold_start}{viz_impact_analysis_begin.strftime('%Y-%m-%d %H:%M:%S')}{bold_end}\n")
🔍 Viz impact Analysis Started ...
🟢 Begin Date: 2025-08-16 01:02:27
Restore the file¶
%store -r df_viz_feat
View or Display Dataset¶
print("\nTrain Data Preview:")
print("\n",df_viz_feat.head())
Train Data Preview:
store dayofweek date sales customers open promo stateholiday schoolholiday day week month quarter year isweekend isholiday isschoolDay
982643 1115 2 2013-01-01 0 0 0 No Promo Public 1 Tue 1 Jan 1 2013 False True False
982640 1112 2 2013-01-01 0 0 0 No Promo Public 1 Tue 1 Jan 1 2013 False True False
982639 1111 2 2013-01-01 0 0 0 No Promo Public 1 Tue 1 Jan 1 2013 False True False
982638 1110 2 2013-01-01 0 0 0 No Promo Public 1 Tue 1 Jan 1 2013 False True False
982637 1109 2 2013-01-01 0 0 0 No Promo Public 1 Tue 1 Jan 1 2013 False True False
# Step 2: Data Ingestion
print("Step 2: Features Engineering started...")
time.sleep(1) # Simulate processing time
Step 2: Features Engineering started...
# Make a copy of the original dataframe to avoid modifying it
df_features = df_viz_feat.copy()
Promotion Impact Analysis¶
def analyze_promotion_impact(df, target_col='sales', category_col='day', width=1200, height=500):
"""
Analyze promotion impact across any category (day, month, store, etc.)
Parameters:
- df: DataFrame with 'promo' column
- target_col: metric to analyze ('sales' or 'customers')
- category_col: grouping category ('day', 'month', 'store', etc.)
"""
# Filter out closed stores (e.g., sales = 0) for unbiased and fair comparison
df = df[df[target_col] > 0].copy()
# Create summary data
summary = df.groupby([category_col, 'promo'])[target_col].mean().reset_index()
# Visualization
fig = px.bar(
summary,
x=category_col,
y=target_col,
color='promo',
title=f'Promotion Impact: {target_col.title()} by {category_col.title()}',
color_discrete_map={'Promo': '#636EFA', 'No Promo': '#EF553B'},
barmode='group'
)
fig.update_layout(title_x=0.5, height=height, width=width)
fig.show(config={'displayModeBar': True, 'displaylogo': False})
# Impact analysis
print(f"Promotion Impact Analysis - {target_col.title()} by {category_col.title()}:")
print("=" * 60)
categories = sorted(df[category_col].unique())
for category in categories:
no_promo = df[(df[category_col] == category) & (df['promo'] == 'No Promo')][target_col].mean()
promo = df[(df[category_col] == category) & (df['promo'] == 'Promo')][target_col].mean()
if not pd.isna(no_promo) and not pd.isna(promo):
lift = ((promo - no_promo) / no_promo) * 100
currency = "€" if target_col == 'sales' else ""
print(f"{str(category):12}: No Promo {currency}{no_promo:6,.0f} | Promo {currency}{promo:6,.0f} | Lift {lift:+5.1f}%")
# Overall summary
overall_no_promo = df[df['promo'] == 'No Promo'][target_col].mean()
overall_promo = df[df['promo'] == 'Promo'][target_col].mean()
overall_lift = ((overall_promo - overall_no_promo) / overall_no_promo) * 100
currency = "€" if target_col == 'sales' else ""
print(f"\nOverall Impact:")
print(f"Average lift from promotions: {overall_lift:+.1f}%")
print(f"Additional revenue per day: {currency}{overall_promo - overall_no_promo:,.0f}")
return summary
Top performer Impact Analysis¶
def analyze_top_performers(df, group_col='store', target_col='sales', top_n=10, width=1200, height=500):
"""
Analyze top performing entities (stores, days, months, etc.)
Parameters:
- df: DataFrame
- group_col: column to group by ('store', 'day', 'month', etc.)
- target_col: metric to analyze ('sales' or 'customers')
- top_n: number of top performers to show
"""
# Filter out closed stores (e.g., sales = 0) for unbiased and fair comparison
df = df[df[target_col] > 0].copy()
# Calculate averages and get top performers
group_avg = df.groupby(group_col)[target_col].mean()
top_performers = group_avg.nlargest(top_n)
# Visualization
fig = px.bar(
x=top_performers.index.astype(str),
y=top_performers.values,
title=f'Top {top_n} {group_col.title()} by Average {target_col.title()}',
labels={'x': group_col.title(), 'y': f'Average {target_col.title()}'}
)
fig.update_layout(title_x=0.5, height=height, width=width)
fig.show(config={'displayModeBar': True, 'displaylogo': False})
# Performance analysis
print(f"Top {top_n} {group_col.title()} Performance Analysis:")
print("=" * 55)
print(f"{'Rank':<4} {group_col.title():<10} {'Average':<15} {'% of #1':<10}")
print("-" * 55)
for i, (entity, avg_value) in enumerate(top_performers.items(), 1):
pct_of_top = (avg_value / top_performers.iloc[0]) * 100
currency = "€" if target_col == 'sales' else ""
print(f"{i:<4} {str(entity):<10} {currency}{avg_value:>9,.0f} {pct_of_top:>6.1f}%")
# Summary statistics
total_entities = len(group_avg)
top_avg = top_performers.mean()
overall_avg = group_avg.mean()
performance_gap = ((top_avg - overall_avg) / overall_avg) * 100
print(f"\nSummary Statistics:")
print(f"Total {group_col}s analyzed: {total_entities:,}")
print(f"Top {top_n} average: {currency}{top_avg:,.0f}")
print(f"Overall average: {currency}{overall_avg:,.0f}")
print(f"Top {top_n} outperform by: {performance_gap:.1f}%")
return top_performers
Temporal Trends Impact Analysis¶
def analyze_temporal_trends(df, time_col='month', target_col='sales', width=1200, height=500):
"""
Analyze trends over time periods (month_name, day_name, year, etc.)
Parameters:
- df: DataFrame
- time_col: time dimension ('month_name', 'day_name', 'year', etc.)
- target_col: metric to analyze ('sales' or 'customers')
"""
# Filter out closed stores (e.g., sales = 0) for unbiased and fair comparison
df = df[df[target_col] > 0].copy()
# Calculate averages by time period
time_stats = df.groupby(time_col).agg({
target_col: ['mean', 'std', 'count']
}).round(0)
time_stats.columns = ['avg', 'std', 'count']
time_stats = time_stats.reset_index().sort_values('avg', ascending=False)
# Visualization
fig = px.bar(
time_stats,
x=time_col,
y='avg',
title=f'{target_col.title()} Performance by {time_col.title()}',
color='avg',
color_continuous_scale='viridis'
)
fig.update_layout(title_x=0.5, height=height, width=width)
fig.show(config={'displayModeBar': True, 'displaylogo': False})
# Trend analysis
print(f"{time_col.title()} Performance Analysis:")
print("=" * 50)
print(f"{'Rank':<4} {time_col.title():<12} {'Average':<15} {'Std Dev':<10} {'Count':<8}")
print("-" * 60)
for i, row in time_stats.iterrows():
currency = "€" if target_col == 'sales' else ""
print(f"{i+1:<4} {str(row[time_col]):<12} {currency}{row['avg']:>9,.0f} {currency}{row['std']:>6,.0f} {row['count']:>6,.0f}")
# Key insights
best_period = time_stats.iloc[0][time_col]
worst_period = time_stats.iloc[-1][time_col]
best_value = time_stats.iloc[0]['avg']
worst_value = time_stats.iloc[-1]['avg']
volatility = ((best_value - worst_value) / time_stats['avg'].mean()) * 100
currency = "€" if target_col == 'sales' else ""
print(f"\nKey Insights:")
print(f"Best {time_col}: {best_period} ({currency}{best_value:,.0f})")
print(f"Worst {time_col}: {worst_period} ({currency}{worst_value:,.0f})")
print(f"Performance range: {currency}{best_value - worst_value:,.0f}")
print(f"Volatility: {volatility:.1f}%")
return time_stats
State Holiday Impact Analysis¶
def analyze_stateholiday_impact(df, target_col ='sales', category_col ='day', width =1200, height =500):
"""
Analyze the impact of state holidays across a specified category (e.g., day, month, store).
Parameters:
- df: DataFrame with 'stateholiday' column already mapped to labels
- target_col: metric to analyze ('sales' or 'customers')
- category_col: grouping category ('day', 'month', 'store', etc.)
"""
# Filter out closed stores (e.g., sales = 0) for unbiased and fair comparison
df_open = df[df[target_col] > 0].copy()
# Create summary data
summary = df_open.groupby([category_col, 'stateholiday'])[target_col].mean().reset_index()
# Visualization
fig = px.bar(
summary,
x =category_col,
y =target_col,
color ='stateholiday',
title =f'State Holiday Impact: {target_col.title()} by {category_col.title()}',
barmode ='group'
)
fig.update_layout(title_x =0.5, height =height, width =width)
fig.show(config={'displayModeBar': True, 'displaylogo': False})
# Impact analysis
print(f"State Holiday Impact Analysis - {target_col.title()} by {category_col.title()}:")
print("=" * 70)
# Get regular day baseline
regular_avg = df_open[df_open['stateholiday'] == 'Normal Day'][target_col].mean()
# Overall holiday impact
print("Overall Holiday Impact:")
print("-" * 25)
for holiday_label in ['Public', 'Easter', 'Christmas']:
if holiday_label in df['stateholiday'].values:
holiday_avg = df_open[df_open['stateholiday'] == holiday_label][target_col].mean()
if not pd.isna(holiday_avg):
impact = ((holiday_avg - regular_avg) / regular_avg) * 100
currency = "€" if target_col == 'sales' else ""
print(f"{holiday_label:15}: {currency}{holiday_avg:6,.0f} ({impact:+5.1f}% vs regular)")
print(f"Regular Days: €{regular_avg:6,.0f} (baseline)")
# Category-wise analysis
print(f"\nHoliday Impact by {category_col.title()}:")
print("-" * 35)
categories = sorted(df[category_col].unique())
for category in categories:
regular = df_open[(df_open[category_col] == category) & (df_open['stateholiday'] == 'Normal Day')][target_col].mean()
holiday = df_open[(df_open[category_col] == category) & (df_open['stateholiday'] != 'Normal Day')][target_col].mean()
if not pd.isna(regular):
currency = "€" if target_col == 'sales' else ""
if not pd.isna(holiday):
impact = ((holiday - regular) / regular) * 100
print(f"{str(category):12}: Regular {currency}{regular:6,.0f} | Holiday {currency}{holiday:6,.0f} | Impact {impact:+5.1f}%")
else:
print(f"{str(category):12}: Regular {currency}{regular:6,.0f} | No holiday data")
# Store closure analysis
total_records = len(df)
closed_stores = len(df[df[target_col] == 0])
holiday_closures = len(df[(df['stateholiday'] != 'Normal Day') & (df[target_col] == 0)])
regular_closures = len(df[(df['stateholiday'] == 'Normal Day') & (df[target_col] == 0)])
print(f"\nStore Operations Impact:")
print("-" * 25)
print(f"Total store closures: {closed_stores:,} ({(closed_stores/total_records)*100:.1f}%)")
print(f"Holiday closures: {holiday_closures:,}")
print(f"Regular closures: {regular_closures:,}")
return summary
RUNNING TOP 4 IMPORTANT ANALYSIS¶
# Example usage:
if __name__ == "__main__":
# Assuming df_features is your dataset
print("=== RUNNING TOP 4 MOST IMPORTANT ANALYSES ===\n")
# 1. Promotion Impact Analysis
print("1. PROMOTION IMPACT ANALYSIS")
print("-" * 40)
promo_results = analyze_promotion_impact(df_viz_feat, target_col='sales', category_col='day')
print("\n" + "="*80 + "\n")
# 2. State Holiday Impact Analysis
print("2. STATE HOLIDAY IMPACT ANALYSIS")
print("-" * 40)
holiday_results = analyze_stateholiday_impact(df_viz_feat, target_col='sales', category_col='day')
print("\n" + "="*80 + "\n")
# 3. Top Store Performance Analysis
print("3. TOP STORE PERFORMANCE ANALYSIS")
print("-" * 40)
store_results = analyze_top_performers(df_viz_feat, group_col='store', target_col='sales', top_n=10)
print("\n" + "="*80 + "\n")
# 4. Temporal Trends Analysis
print("4. TEMPORAL TRENDS ANALYSIS")
print("-" * 40)
trend_results = analyze_temporal_trends(df_viz_feat, time_col='month', target_col='sales')
=== RUNNING TOP 4 MOST IMPORTANT ANALYSES === 1. PROMOTION IMPACT ANALYSIS ----------------------------------------
Promotion Impact Analysis - Sales by Day: ============================================================ Fri : No Promo € 6,352 | Promo € 7,734 | Lift +21.8% Mon : No Promo € 6,205 | Promo € 9,776 | Lift +57.5% Thu : No Promo € 5,737 | Promo € 7,595 | Lift +32.4% Tue : No Promo € 5,716 | Promo € 8,277 | Lift +44.8% Wed : No Promo € 5,610 | Promo € 7,678 | Lift +36.9% Overall Impact: Average lift from promotions: +38.8% Additional revenue per day: €2,302 ================================================================================ 2. STATE HOLIDAY IMPACT ANALYSIS ----------------------------------------
State Holiday Impact Analysis - Sales by Day: ====================================================================== Overall Holiday Impact: ------------------------- Public : € 8,487 (+21.8% vs regular) Easter : € 9,888 (+41.9% vs regular) Christmas : € 9,744 (+39.9% vs regular) Regular Days: € 6,967 (baseline) Holiday Impact by Day: ----------------------------------- Fri : Regular € 7,080 | Holiday € 6,879 | Impact -2.8% Mon : Regular € 8,231 | Holiday € 8,049 | Impact -2.2% Sat : Regular € 5,895 | Holiday € 5,732 | Impact -2.8% Sun : Regular € 8,191 | Holiday € 8,406 | Impact +2.6% Thu : Regular € 6,783 | Holiday € 6,407 | Impact -5.6% Tue : Regular € 7,103 | Holiday € 6,939 | Impact -2.3% Wed : Regular € 6,738 | Holiday € 6,396 | Impact -5.1% Store Operations Impact: ------------------------- Total store closures: 168,494 (17.1%) Holiday closures: 39,994 Regular closures: 128,500 ================================================================================ 3. TOP STORE PERFORMANCE ANALYSIS ----------------------------------------
Top 10 Store Performance Analysis: ======================================================= Rank Store Average % of #1 ------------------------------------------------------- 1 817 € 21,789 100.0% 2 262 € 20,684 94.9% 3 1114 € 20,576 94.4% 4 251 € 19,158 87.9% 5 842 € 18,472 84.8% 6 513 € 18,212 83.6% 7 562 € 17,985 82.5% 8 788 € 17,968 82.5% 9 383 € 17,293 79.4% 10 756 € 16,575 76.1% Summary Statistics: Total stores analyzed: 1,115 Top 10 average: €18,871 Overall average: €6,931 Top 10 outperform by: 172.3% ================================================================================ 4. TEMPORAL TRENDS ANALYSIS ----------------------------------------
Month Performance Analysis: ================================================== Rank Month Average Std Dev Count ------------------------------------------------------------ 3 Dec € 8,609 € 4,094 50,393 10 Nov € 7,189 € 3,010 51,401 9 May € 7,107 € 3,008 80,099 1 Apr € 7,047 € 3,138 81,726 7 Jun € 7,001 € 3,145 82,571 8 Mar € 6,977 € 3,124 85,975 6 Jul € 6,910 € 3,038 55,388 2 Aug € 6,649 € 2,902 54,411 11 Oct € 6,603 € 2,815 53,291 4 Feb € 6,589 € 2,871 80,239 5 Jan € 6,564 € 2,849 86,335 12 Sep € 6,547 € 2,889 52,321 Key Insights: Best month: Dec (€8,609) Worst month: Sep (€6,547) Performance range: €2,062 Volatility: 29.5%
print("✅ Data Visualization Impact Analysis completed.\n")
✅ Data Visualization Impact Analysis completed.
print("✅ Features Engineering and Data Visualization (I) completed successfully!")
print(f"🗓️ Analysis Date: {bold_start}{pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')}{bold_end}")
✅ Features Engineering and Data Visualization (I) completed successfully!
🗓️ Analysis Date: 2025-08-16 01:02:34
🌟 Advantages¶
Reusable across 'year', 'month', 'dayofweek', etc.
Easy to change aggregation type ('sum', 'median', etc.)
Consistent naming and sorting
Makes your code far more modular for dashboards or reporting
Why Reusability Matters¶
- 💡 Scalability: You can plug your functions into larger pipelines or production environments without rewrites.
- 🛠️ Maintainability: A bug fix in one utility can instantly improve multiple workflows.
- 🚀 Efficiency: Spend less time rewriting logic and more time interpreting results.
Why This Matters for Rossmann Store Sales¶
- We’ll likely repeat the same aggregations or visualizations across hundreds of stores.
- Promos, holidays, and weekday patterns demand consistent filtering and analysis.
- Modular functions help you prototype insights fast, scale across stores, and iterate smoothly.
# End analysis
viz_impact_analysis_end = pd.Timestamp.now()
duration = viz_impact_analysis_end - viz_impact_analysis_begin
# Final summary print
print("\n📋 Features Engineering && Data Viz Summary")
print(f"🟢 Begin Date: {bold_start}{viz_impact_analysis_begin.strftime('%Y-%m-%d %H:%M:%S')}{bold_end}")
print(f"✅ End Date: {bold_start}{viz_impact_analysis_end.strftime('%Y-%m-%d %H:%M:%S')}{bold_end}")
print(f"⏱️ Duration: {bold_start}{str(duration)}{bold_end}")
📋 Features Engineering && Data Viz Summary 🟢 Begin Date: 2025-08-16 01:02:27 ✅ End Date: 2025-08-16 01:02:34 ⏱️ Duration: 0 days 00:00:07.113379
Project Design Rationale: Notebook Separation¶
To promote clarity, maintainability, and scalability, the project adopts a modular notebook architecture. Each phase of the data workflow—or lifecycle step—is intentionally organized into its own distinct notebook. This modular approach prevents the accumulation of excessive code in a single notebook, making it easier to debug, update, and collaborate across different stages of the workflow. By isolating data transformation logic from visual analysis, each notebook remains focused and purpose-driven, ultimately enhancing the overall efficiency and readability of the project.